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ABSTRACT 



A remote data shadowing system provides storage based, 
real time disaster recovery capability. Record updates at a 
primary site cause write I/O operations in a storage sub- 
system therein. The write I/O operations are time stamped 
and the time, sequence, and physical locations of the record 
updates are collected in a primary data mover. The primary 
data mover groups sets of the record updates and associated 
control information based upon a predetermined time 
interval, the primary data mover appending a prefix header 
to the record(updates thereby forming self describing record 
sets. The self describing record sets are transmitted to a 
remote secondary site wherein consistency groups are 
formed such that the record updates are ordered so that the 
record updates can be shadowed in an order consistent with 
the order the record updates cause write I/O operations at the 
primary site. 

21 Claims, 14 Drawing Sheets 
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FORMING CONSISTENCY GROUPS USING 
SELF-DESCRIBING RECORD SETS FOR 
REMOTE DATA DUPLEXING 

This is a continuation of U.S. application Set No. 
08/199,444, filed Feb. 22, 1994. now abandoned. 

FIELD OF THE INVENTION 

The present invention relates generally to disaster recov- 
ery techniques, and more particularly, to a system for 
real-time remote copying of direct access storage device 
(DASD) data. 

BACKGROUND OF THB INVENTION 

Data processing systems, in conjunction with processing 
data, typically are required to store large amounts of data (or 
records), which data can be efficiently accessed, modified, 
and re-stored. Data storage is typically separated into several 
different levels, or hierarchically, in order to provide efficient 
and cost effective data storage, A first, or highest level of 
data storage involves electronic memory, usually dynamic or 
static random access memory (DRAM or SRAM). Elec- 
tronic memories take the form of semiconductor integrated 
circuits wherein millions of bytes of data can be stored on 
each circuit, with access to such bytes of data measured in 
nano- seconds. The electronic memory provides the fastest 
access to data since access is entirely electronic 

A second level of data storage usually involves direct 
access storage devices (DASD). DASD storage, for 
example, can comprise magnetic and/or optical disks, which 
store bits of data as micrometer sized magnetic or optical 
altered spots on a disk surface for representing the "ones" 
and "zeros" that make up those bits of the data. Magnetic 
DASD, includes one or more disks that are coated with 
remnant magnetic material. The disks are rotatably mounted 
within a protected environment Each disk is divided into 
many concentric tracks, or closely spaced circles. The data 
is stored serially, bit by bit, along each track. An access 
mechanism, known as a head disk assembly (HDA), typi- 
cally includes one or more read/write heads, and is provided 
in each DASD for moving across the tracks to transfer the 
data to and from the surface of the disks as the disks are 
rotated past the read/write heads. DASDs can store giga- 
bytes of data with the access to such data typically measured 
in nuHi-seconds (orders of magnitudes slower than elec- 
tronic memory). Access to data stored on DASD is slower 
due to the need to physically position the disk and HDA to 
the desired data storage location. 

A third or lower level of data storage includes tape and/or 
tape and DASD libraries. At this storage level access to data 
is much slower in a library since a robot is necessary to 
select and load the needed data storage medium. The advan- 
tage is reduced cost for very large data storage capabilities, 
for example, tera-bytes of data storage. Tape storage is often 
used for back-up purposes, that is, data stored at the second 
level of the hierarchy is reproduced for safe keeping on 
magnetic tape. Access to data stored on tape and/or in a 
library is presently on the order seconds. 

Having a back-up data copy is mandatory for many 
businesses as data loss could be catastrophic to the business. 
The time required to recover data lost at the primary storage 
level is also an important recovery consideration. An 
improvement in speed over tape or library back-up, includes 
dual copy. An example of dual copy involves providing 
additional DASD's so that data is written to the additional 
DASDs (sometimes referred to as mirroring). Then if the 



J4 ? 818 

2 

primary DASDs fail, the secondary DASDs can be 
depended upon for data. A drawback to this approach is that 
the number of required DASDs is doubled. 
Another data back-up alternative that overcomes the need 

3 to provide double the storage devices involves writing data 
to a redundant array of inexpensive devices (RAID) con- 
figuration. In this instance, the data is written such that the 
data is apportioned amongst many DASDs, If a single 
DASD fails, then the lost data can be recovered by using the 

to remaining data and error correction procedures. Currently 
there are several different RAID configurations available. 

The aforementioned back-up solutions are generally suf- 
ficient to recover data in the event that a storage device or 
medium fails. These back-up methods are useful only for 

15 device failures since the secondary data is a mirror of the 
primary data, that is, the secondary data has the same 
volume serial numbers (VOLSERs) and DASD addresses as 
the primary data. System failure recovery, on the other hand, 
is not available using mirrored secondary data. Hence still 

20 further protection is required for recovering data if a disaster 
occurs destroying the entire system or even the site, for 
example, earthquakes, fires, explosions, hurricanes, etc. 
Disaster recovery requires that the secondary copy of data be 
stored at a location remote from the primary data. A known 

25 method of providing disaster protection is to back-up data to 
tape, on a daily or weekly basis, etc The tape is then picked 
up by a vehicle and taken to a secure storage area usually 
some kilo-meters away from the primary data location. A 
problem is presented in this back-up plan in that it could take 

30 days to retrieve the back-up data, and meanwhile several 
hours or even days of data could be lost, or worst, the storage 
location could be destroyed by the same disaster. A some- 
what improved back-up method would be to transmit data to 
a back-up location each night This allows the data to be 

35 stored at a more remote location. Again, some data may be 
lost between back-ups since back-up does not occur 
continuously, as in the dual copy solution. Hence, a sub- 
stantial data amount could be lost which may be unaccept- 
able to some users. 

40 

More recently introduced data disaster recovery solutions 
include remote dual copy wherein data is backed- up not only 
remotely, but also continuously. In order to communicate 
duplexed data from one host processor to another host 

45 processor, or from one storage controller to another storage 
controller, or some combination thereof, a substantial 
amount of control data is required for realizing the process. 
A high overhead, however, can interfere with a secondary 
site's ability to keep up with a primary site's processing, 

w thus threatening the ability of the secondary site to be able 
to recover the primary in the event a disaster occurs. 

Accordingly it is desired to provide a method and appa- 
ratus far providing a real time update of data consistent with 
the H»t« at a primary processing location using minim*! 

53 control data, wherein the method and apparatus operates 
independently of a particular application data being 
recovered, that is, generic storage media based rather than 
specific application data based. 

SUMMARY OF THE INVENTION 

60 

An object of the present invention is to provide an 
improved design and method for shadowing DASD data to 
a secondary site for disaster recovery. 

According to a first embodiment of the present Invention, 
65 a method for forming consistency groups provides for 
disaster recovery capability from a remote site. Data updates 
generated by one or more applications running in a primary 
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processor are received by a primary storage subsystem, FIG.8isaniasterjc4iraalasusedby the secondary site of 

wherein the primary storage subsystem causes I/O write FIG. 4. 

operations to write each data update therein. The primary fig. 9 is an example sequence for forming a consistency 

storage subsystem is synchronized by a common timer, and group. 

a secondary system, remote from the primary processor, 5 RG io is a flow diagram showing a method of collecting 

shadows the data updates in sequence consistent order such infonnatioil rcad record sets for forming consistency 
that the secondary site is available for disaster recovery 

purposes. The method comprising steps of: (a) time stamp- & V ' . , - - . 

ing^h write I/Oop^ ™. 11 is a flow diagram showing a method of forming 

subsystem; (b) capturing write VO operation record set consistency groups. 

information from the primary storage subsystem for each FIG. 12 is a table indicating full consistency group 

data update; (c) generating self describing record sets from recovery rules application for an ECKD architecture device 

the data updates and the respective record set information, for a given sequence of I/O operations to a DASD track, 

such that the self describing record sets are sufficient to jtjq 13A-13B is a description of the rules to be used in 

re-create a sequence of the write VO operations; (d) group- we 0 f no. 12. 

ing the self describing record sets into interval groups based 14 ^ a fl ow diagram of a method of writing read 

upon a predetermined interval threshold; and (e) selecting a record set copies to a secondary site with full consistency 

first consistency group as that interval group of self describ- recovery capability. 

ing record sets having an earliest operational time stamp, the 

individual data updates being ordered within the first con- ^ DETAILED DESCRIPTION 

sistency group based upon time sequences of the I/O write A typical data processing system may take the farm of a 

operations in the primary storage subsystem. host processor, such as an IBM System/360 or IBM System/ 

In another embodiment of the present invention, a pri- 370 processor for computing and manipulating data, and 
mary system has a primary processor running one or more running, for example, data facility storage management 
applications, wherein the applications generating record ^ subsystem/multiple virtual systems (DFSMS/MVS) 
updates, and the primary processor generating self describ- software, having at least one IBM 3990 storage controller 
ing record sets therefrom. Each self describing record set is attached thereto, the storage controller comprising a 
sent to a secondary system remote from the primary system, memory controller and one or more cache memory types 
wherein the secondary system shadows the record updates in incorporated therein. The storage controller is further con- 
sequence consistent order based upon the self describing ^ nected to a group of direct access storage devices (DASDs) 
record sets for real time disaster recovery purposes. The such as IBM 3380 or 3390 DASDs. While the host processor 
primary processor is coupled to a primary storage subsystem provides substantial computing power, the storage controller 
wherein the primary storage subsystem receives the record provides the necessary functions to efficiently transfer, 
updates and causes I/O write operations for storing each stage/destage. convert and generally access large databases, 
record update therein. The primary processor comprises a M Disaster recovery protection for the typical data process- 
sysplex timer for providing a common time source to the j n g system requires that primary data stored on primary 
applications and to the primary storage subsystem for syn- DASDs be backed-up at a secondary or remote location. The 
chronization purposes, and a primary data mover, synchro- distance separating the primary and secondary locations 
nized by the sysplex timer, prompts the primary storage depends upon the level of risk acceptable to the user, and can 
subsystem for providing record set information to the pri- ^ vary fro m several kilo-meters to thousands of kilo-meters, 
mary data mover for each record update. The primary data jhc secondary or remote location, in addition to providing 
mover groups a plurality of record updates and each cone- a back-up data copy, must also have enough system inf or- 
sponding record set information into time interval groups, nation to take over processing for the primary system 
and inserts a prefix header thereto. Each time interval group should the primary system become disabled. This is due in 
forms the self describing record sets. 45 part because a single storage controller does not write data 

The foregoing and other objects, features, and advantages to bath primary and secondary DASD strings at the primary 

of the invention will be apparent from the following more ^d secondary sites. Instead, the primary data is stored on a 

particular description of a preferred embodiment of the primary DASD string attached to a primary storage control- 

inventioa, as illustrated in the accompanying drawing. lcr while the secondary data is stored on a secondary DASD 

DESCRIPTION OF THE FIGURES 50 string attached to a secondary storage controller. 

^ J . A „ The secondary site must not only be sufficiently remote 

FIG. 1 is a block diagram of a disaster recovery system ^ ^ ?£ry site, but must also be able to back-up 

having synchronous remote data shadowing capabilities. ^ in ^ time. The secondary site needs to 

FIG. 2 is a flow diagram of a method for providing t^ck-up primary data as the primary data is updated with 

synchronous remote copy according to the disaster recovery ^ somc minimal Additionally, the secondary site has to 

system of FIG. 1. back-up the primary data regardless of the application 

FIG. 3 is a flow diagram of a method of an I/O error program ( c ,g M IMS, DB2) running at the primary site and 

recovery program (I/O ERP) operation. generating the data and/or updates. A difficult task required 

FIG. 4 is a block diagram of a disaster recovery system 0 f mc secondary site is that the secondary data must be order 

having asynchronous remote data shadowing capabilities. ^ consistent, that is, secondary data is copied in the same 

FIG. 5 is a data format diagram showing a prefix header sequential order as the primary data (sequential consistency) 

that prefixes read record sets from the primary site of FIG. which requires substantial systems considerations. Sequen- 

4. tial consistency is complicated by the existence of multiple 

FIG. 6 is a data format diagram describing fields making storage controllers each controlling multiple DASDs in a 

up a read record set 65 data processing system. Without sequential consistency, 

FIG. 7 is a state table identifying volume configuration secondary data inconsistent with primary data would result, 

information. thus corrupting disaster recovery. 
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Remote data duplexing falls into two general categories, sync The primary site can also cause the secondary site to 

synchronous and asynchronous. Synchronous remote copy lose sync by issuing a suspend command for that volume to . 

involves sending primary data to the secondary location and the primary DASD. The secondary site re-syncs with the 

confirming the reception of such data before ending a primary site after tlie suspend command is ended' duplex 

primary DASD input/output (I/O) operation (providing a 5 pair is re-established,, and pending updates are copied./ 

channel end (CE) and device end (DE) to the primary host). On-line maintenance can also cause synchronization to be 

Synchronous copy, therefore, slows the primary DASD I/O iosL 

response time while waiting for secondary confirmation. When a secondary volume is out of sync with a primary 

Primary VO response delay is increased proportionately with volume, the secondary volume is not useable for secondary 

the distance between the primary and secondary systems — a to system recovery and resumption of primary applications. An 

factor that limits the remote distance to tens of kilo-meters. out-of-sync volume at the secondary site must be identified 

Synchronous copy, however, provides sequentially consis- ** such and secondary site recovery-takeover procedures 

tent data at the secondary site with relatively little system aeed *> i^tify the out-of-sync volumes for denying appli- 

overhead. cation access (forcing the volumes off-line ox changing their 

Asyn<±ronousremote copy provides better prirnaryappli- is VOLSERs). The secondary site may be called upon to 

cation system performance^ecluse the r^rD^DVO recover to where* the pninary 

operation is completed (providing a channel end (CE) and Sltc host » maccessible-mus the secondary site reqinresall 

ScndSo meSmaryhost) before data is'con- P***« Inforroation about a sync state of^volu^The 

firmed at the secondary sue. Therefore, the primary DASD storage < rtqm that isthe secondary storage 

I/O response time is not dependent upon the distance to the * controllers and DASD is unable to determine all condxtions 

secondary site and the secondary rite^ould be thousands of * e t** 0 ** j lte ^*romsm due to pn- 

kilo-meters remote from the primary site. A greater amount i^^ncc™^ 

of system overhead is required, however, for ensuring data site ™y*«* « <M« P** if the primary site is unable to 

sequence consistency since data received at the secondary ^T'^JT* t*- * OT ^ 

site will often not be in order of the primary updates. X 25 failure that &e secondary sue is unaware of In this case the 

faUure atmeprmiary site could result in some data being lost ^"^J* &™ whilc mc i™* ^ 

that was in^ansit between the primary and secondary ***** me *9**vmr * *<^* ^ a , 

locations. External communication may notify the secondary site 

that an out-of-sync duplex pair volume exists. This is 
SYNCHRONOUS DATA SHADOWING ^ realizable by employing a user systems management rune- 
Synchronous real time remote copy for disaster recovery tion. Primary VO operations end with channel end/device 
requires that copied DASD volumes form a set Forming end/unit check (CE/DE/UQ status and sense data indicates 
such a set further requires that a sufficient amount of system the nature of the error. With this form of I/O configuration 
information be provided to the secondary site for identifying an error recovery program (ERP) processes the error and 
those volumes (VOLSERs) comprising each set and the 35 send an appropriate message to the secondary processor 
primary site equivalents. Importantly, the secondary site before posting the primary application that I/O is complete, 
forms a "duplex pair" with the primary site and the second- The user is then responsible to recognize the ERP suspend 
ary site must recognize when one or more volumes are out duplex pair message and secure that information at the 
of sync with the set, that is, "failed duplex** has occurred secondary location. When the secondary site is depended 
Connect failures are more visible in synchronous remote 40 upon to become operational in place of the primary site, a 
copy than in asynchronous remote copy because the primary start-up procedure brings the secondary DASD on-line to the 
DASD I/O is delayed while alternate paths are retried. The secondary host wherein sync status stored in the secondary 
primary site can abort or suspend copy to allow the primary DASD subsystem is retrieved for ensuring that out-of-sync 
site to continue while updates for the secondary site are volumes are not brought on-line for application allocation, 
queued, the primary site marking such updates to show the 4s This sync status merged with all ERP suspend duplex pair 
secondary site is out of sync. Recognizing exception con- messages gives a complete picture of the secondary out-of- 
ditions that may cause the secondary site to fall out of sync sync volumes. 

with the primary site is needed in order that the secondary Referring now to FIG. 1, a disaster recovery system 10 is 

site be available at any time for disaster recovery. Error shown having a primary site 14 and a secondary site 15, 

conditions and recovery actions must not make the second- so wherein the secondary site 15 is located, for example, 20 

ary site inconsistent with the primary site. kilo-meters remote from the primary site 14. The primary 

Maintaining a connection between die secondary site and site 14 includes a host processor or primary processor 1 

the primary site with secondary DASD present and having an application and system I/O and Error Recovery 

accessible, however, does not ensure content synchronism. Program 2 running therein (hereinafter referred to as I/O 

The secondary site may lose synchronism with the primary 33 ERP 2). The primary processor 1 could be, for example, an 

site for a number of reasons. The secondary she is initially IBM Enterprise Systems/9000 (ES/9000) processor running 

out of sync when the duplex pair is being formed and DFSMS/MVS operating software and further may have 
reaches sync when an initial data copy is completed. Tbe^ several application programs running thereon. A primary 

primary site may break the duplex pair if the primary site is < storage controller 3, for example, an IBM 3990 Model 6 

unable to write updated data to the secondary site in which 60 storage controller, is connected to the primary processor 1 

case the primary site writes updates to the primary DASD via a channel 12. As is known in the art, several such primary 

under suspended duplex pair conditions so that the updating ' storage controllers 3 can be connected to the primary 

application can continue. The primary site is thus running processor 1, or alternately, several primary processors 1 can 

exposed, that is, without current disaster protection copy be attached to the primary storage controllers 3. A primary 

until the duplex pair is restored. Upon restoring the duplex 65 DASD 4, for example, an IBM 3390 DASD, is connected to 

pair, the secondary site is not immediately in sync After' ^ the primary storage controller 3. Several primary DASDs 4 

applying now pending updates, the secondary site returns to / can be connected to the primary storage controller 3. The 
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primary storage controller 3 and attached primary DASD 4 primary processor 1. The I/O ERP2 quiesces the appUcadon 
form a primary substorage system. Further, the primary programs hence taking control of the primary processor 1 at 
storage controller 3 and the primary DASD 4 could be single step 213 for error recovery and data integrity before return- 
integral units. ing control to the application requesting the write VO 

The secondary site 15 includes a secondary processor 5, 5 operation, 
for example, an IBM ES/9000, connected to a secondary FIG. 3 represents steps performed by the VO ERP 2. The 
storage controller 6, for example an IBM 3990 Model 3. via I/O ERP 2 issues a sense I/O to the primary storage con- 
a channel 13. A DASD 7 is further connected to the troller 3 at step 221. The sense I/O operation returns 
secondary storage controller 6. The primary processor 1 is information describing the cause of the I/O error, that is, the 
connected to the secondary processor 5 by at least one 10 data description information is unique to the storage con- 
host-to-host communication link 11. for example, channel rollers or duplex pair operation regarding specific errors. In 
links or telephone T1/T3 line links, etc. The primary pro- the event mat the data description information indicates that 
cessor 1 may also have direct connectivity with the second- the peer-to-peer communication links 8 have failed between 
ary storage controller 6 by, for example, multiple Enterprise the primary storage controller 1 and the secondary storage 
Systems Connection (ESCON) links 9. As a result, the I/O I5 controller 6, then at step 223 the I/O ERP 2 issues a storage 
ERP 2 can communicate, if required, with the secondary controller level I/O operation against the primary storage 
storage controller 6. The primary storage controller 3 com- controUer3 and the secondary storage controller 6 indicating 
municates with the secondary storage controller 6 via mul- that the affected volume is to be placed in failed synchronous 
dole peer-to-peer links 8, for example, multiple ESCON remote copy state. This secondary storage controller 6 is 
fote^ „ able to receive the state of the affected volume fromthel/O 

When a write I/O operation is executed by an application ERP 2 via the multiple ESCON links 9 or the ^t-to-host 
program running in^primary processor 1, a hardware commumcaUon link H. 

status channel end/device end (CE/DE) is provided indicat- the duplex pair operation is maintamed at bom the primary 
ing the I/O operation completed successfully. Primary pro- P^or 1 and the secondary processor 5 in conjunction 
cessor 1 operating systerrT software marks the appUcation « with WhcaUons running m the primary < processor 1. Con- 
write I/O^sM upon successful completion of the I/O soles 18 and 19 are provided for educating information 
operation, thus permitting the application program to con- from the primary proc^sor l*id seconoary processor 4, 
tiWto a next S/O operation which mVybe dependent respectively, wherein the VO ERP posts status information 
upon the first or previous write I/O operation having sue- to both consoles 18 and 19. 

cessfuily completed. On the other hand, if the write I/O 30 ^ integrity has been maintamed at step 225 upon 
operation was unsuccessful, the VO status of channel end/ successful completion of the failed synchronous remote 
device end/unit check hereinafter referred to as CE/DE/UC) copy VO operation to the primary storage controller 3 and 
is presented to the primary processor 1 operating system the secondary storage controller 6. Therefore, if a recovery 
software. Having presented unit check, the I/O ERP 2 takes is attempted at the secondary site 15 the secondary storage 
control obtaining specific sense information from the pri- 35 controller 6 identifies the volume marked 'failed synchro- 
mary storage controller 3 regarding toe nature of the failed nous remote copy" as not being useable until data on that 
write I/O operation. If a unique error to a volume occurs then volume are synchronized with other data in that synchroni- 
a unique status related to that error is provided to the VO zation group by data recovery means (conventional data 
ERP 2. The I/O ERP 2 can thereafter perform new peer-to- base logs and/or journals for determining the state of that 
peer synchronization error recovery for maintaining data 40 data on the volume). 

integrity between the primary storage controller 3 and the Step .227 tests to determine whether the I/O ERP 2 
secondary storage controller 6, or in the worst case, between received successful completion of the VO operations at the 
the primary processor 1 and the secondary processor 5. primary storage controller 3 and the secondary storage 

Referring to FIGS. 2 and 3, the error recovery procedure controller 6 on the failed synchronous remote copy status 
is set form. In FIG. 2, a step 201 includes an appUcation 45 update. Upon successful completion, the VO ERP 2 returns 
program running in the primary processor 1 sending a data control to the Primary processor 1 at step 229. Otherwise 
update to the primary storage controller 3. At step 203 the step 231 performs a next level recovery notification which 
data update is written to the primary DASD 4, and the data involves notifying an operator, via the console 18, of the 
update is shadowed to the secondary storage controller 6. At failed volume and that a status of that volume at either the 
step 205 the duplex pair status is checked to detennine 30 primary storage controller 3 or the secondary storage con- 
whether the primary and secondary sites are synchronized. troller 6 may not be correct The notification is shadowed to 
If toe duplex pair status is in a syiichronized state, then the the secondary site 15, via toe console 19 or a shared DASD 
data update is written to the secondary DASD 7 at step 207 data set, for indicating toe specific volume status there, 
while processing then continues at the primary processor 1 An error log recording data set is .updated at step 233. This 
via application programs running thereat. 55 update is written to either toe primary DASD 4 or some 

- m the case that the duplex pair is in a "faflwT state, then - other storage location and is shadowed to the secondary site 
! at step ~209*the*pru^^ t, 15. Having completed the error recovery actions, the VO 

primary processor 1 that duplex pair haslus^na^orTailed. / ERP 2, at step 235, posts to the primary appUcation write I/O 
The duplex pair can become bailed" due.to^corm^nication - operation a '^permanent error" for causing the primary 

Wli«*<«» an error normal -permanent error- recovery for 
secondary storage controller 6 via communication links 87 . toe failed write I/O operation. Once the error is corrected, 
Alternatively, duplex pak fc can.rxwmeAtaaed^e to errors the volume states can be recovered, first to pending (recopy 
mmm^^W^^^XWU failure'' changed data) and then back to full duplex. The data may 
is in the commu^cauon WM 8, then toe primary storage later be re-applied to toe secondary DASD 7 once duplex 
controller 3 is unarjle^to^mmunicate the failure directly to 65 pair is re-established. 

the secondary ^toragetControUer f 6/At step 211 the primary When establishing a duplex pair a volume can be identi- 
storage controlier 3 returns I/O status CE/DE/UC to the fied as CRITICAL according to a customer's needs. For a 
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CRITICAL volume, when an operation results in failing a the same time stamp value. The resolution, and not the 

duplex pair, a permanent error failure of the primary volume accuracy, of the sysplex timer 407 is critical. The PDM 404. 

is reported irrespective of the actual error's location. With though shown connected to the sysplex timer 407. is not 

CRir=Y. all subsequent attempts to write to the primary required to synchronize to the sysplex timer 407 since write 

DASD 405 of the failed pair wul receive a permanent error, J I/O operations are not generated therein^ sysplex timer 407 

ensuring that no data is written to that primary volume that * not requh^ if tte primary processor ^401 hasa smgfcdme 

cannot also be shadowed to the paired^econdary volume. "fawce ( fa eMm P le - a ***** multi-processor ES/9000 

This permits complete synchronization with the primary sy ; c t . _ Ati - <- 

appuS actioTs and\e I/O data operations when J^g-g ^ = 

rajuire<L nccted to the primary processor 401 via a plurality of 
Consequently, the disaster recovery system 10 described channels, for example, fiber optic channels. Connected to 
herein, introduces outboard synchronous remote copy such ^ ptil0X y controller 405 is at least one string of 
that a primary host process error recovery procedure having pj^^ DA SDs 406. for example, IBM 3390 DASDs. The 
an VO order (channel command word (CCW)) may change primary storage controllers 405 and fee primary DASDs 406 
a status of a primary and secondary synchronous remote H fom a primary storage subsystem. Each storage controller 
copy volume from duplex pair to failed duplex thereby ^ ^ DASD 406 need not be separate units, but 
maintaining data integrity for several types of primary and ^ ^ combined into a single drawer, 
secondary subsystem errors. Storage based back-up, rather ^ ^ m locatod for example, some thou- 
than application based back-up, wherein data updates are of a^nam rcmotc from the primary site 421. 
duplicated in real time has been provided. The disaster 20 ^ ^ ^^^^ ^ 42 l. includes a secondary pro- 
recovery system 10 also attempts several levels of primary/ 4U h > a mover (SDM) 414 
secondary status updates, including: (1) primary and sec- ^ mcrein. Alternatively, the primary and secondary 
ondary storage controUer volume status updates; , (2) primary ±c ^ locatio n, and further, the primary and 
and secondary host processor notification on specific volume seconda ry data movers can reside on a single host processor 
update status via operator mess ages or error log recording 25 ^ DASDs may be just over a fire-wall). A plurality 
common data sets; and (3) CRITICAL volume ina^cauon, ^ controllers 415 are connected to the 
future updates to the primary volume can be prevented if the sccondar ^ 4U via channels, for example, fiber 
volume pair goes failed duplex. Hence, real time, full error . <± ^ h ^ is ^ in the art Connected to fee 
disaster recovery is accomplished. ^ storagc controllers 415 are a plurality of secondary DASDs 
ASYNCHRONOUS DATA SHADOWING 416 and a control information DASD(s) 417. The storage 
Asynchronous remote data shadowing is used when it is controllers 415 and DASDs 416 and 417 comprise a sec- 
necessary to further increase a distance between primary and ondary storage subsystem. 

secondary sites for reducing fee probability that a single The primary site 421 communicates with the secondary 
disaster will corrupt both primary and secondary sites, or 35 site 431 via a communication link 408. More specifically, 
when primary application performance impact needs to be the primary processor 401 transfers data and control infar- 
minimized. While the distance between primary and sec- mation to the secondary processor 411 by a communications 
ondary sites can now stretch across the earth or beyond, the protocol, for example, a virtual telecommunications access 
synchronization of write updates across multiple DASD method (VTAM) communication link 408. The cornmum- 
volumes behind multiple primary subsystems to multiple 40 cati° n link 408 can be realized by several suitable commu- 
secondary subsystems is substantially more complicated. nication methods, including telephone <T1, T3 lines), radio. 
Record write updates can be shipped from a primary storage radio/telephone, microwave, satellite, etc. 
controller via a primary data mover to a secondary data The asynchronous data shadowing system 400 encom- 
mover for shadowing on a secondary storage subsystem, but passes collecting control data from the primary storage 
the amount of control data passed therebetween must be 45 controllers 405 so that an order of all data writes to the 
miriimized while still being able to re-construct an exact primary DASDs 406 is preserved and applied to the sec- 
order of the record write updates on the secondary system ondary DASDs 416 (preserving the data write order across 
across several storage controllers as occurred on the primary all primary storage subsystems). The data and control infor- 
system across multiple DASD volumes behind several stor- mation transmitted to the secondary site 431, must be 
age controllers. 50 sufficient such that the presence of the primary site 421 is no 

FIG. 4 depicts an asynchronous disaster recovery system longer required to preserve data integrity, 
400 including a primary site 421 and a remote or secondary The applications 402, 403 generate data or record updates, 
site 431. The primary site 421 includes a primary processor which record updates are collected by the primary storagc 
401, for example, an IBM ES/9000 running DFSMS/MVS controllers 405 and read by the PDM 404. The primary 
host software. The primary processor 401 further includes 55 storage controllers 405 each grouped its respective record 
application programs 402 and 403, for example, IMS and updates for an asynchronous remote data shadowing session 
DB2 applications, and a primary data mover (PDM) 404. A and provides those record updates to the PDM 404 via 
common sysplex dock 407 is included in the primary non-specific primary DASD 406 READ requests. Transfer- 
processor 401 for providing a common reference to all ring record updates from the primary storage controllers 405 
applications (402, 403) running therein, wherein all system 60 to the PDM 404 is controlled and optimized by the PDM 404 
clocks or time sources (not shown) synchronize to the for minimizing a niirnber of START I/O operations and time 
sysplex clock 407 ensuring all time dependent processes are delay between each read, yet m a ximiz i ng an amount of data 
properly timed relative to one another. The primary storage transferred between each primary storage controller 405 and 
controllers 406, for example, synchronize to a resolution the primary processor 401. The PDM 404 can vary a time 
appropriate to ensure differentiation between record write 65 interval between non-specific READs to control this primary 
update times, such that no two consecutive write I/O opera- storage controller-host optimization as well as a currency of 
tions to a single primary storage controller 404 can exhibit the record updates for the secondary DASDs 416. 
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Collecting record updates by the PDM 404. and transmit- volume 506 is assigned by either the PDM 404 or the SDM 

ting those record updates to the SDM 414. while maintaining 414 depending upon performance considerations. A records 

data integrity, requires the record updates to be transmitted read time 507 supplies an operational time stamp that is 

for specific time intervals and in appropriate multiple time common to all primary storage controllers 405 indicating an 

intervals with enough control data to reconstruct the primary 5 end time for the PDM 404 read record set process current 

DASDs 406 record WRITE sequence across all primary interval- 

storage subsystems to the secondary DASDs 416. THe opottonal^^ and the records read time 

Re-constructing the primary DASDs 406 record WRTTB 5a7 ™ used by the PDM 404 to group sets of readreccord 

ir «^™i;ck*h k« «oe*i«« oif sets from each of the primary storage controllers 405. Time 

L t VZ^ * l^i/^SSSS ,n synchronization for grouping sets <Tf read record sets is key 

records from the PDM 404 to the SDM 414. The SDM 414 to J ^ ^ ^ ^ mc PDM 404 could be 

inspects me self describing records for drtermmmg whether s / chronkcd to a ^tinl processing unit (CPU) clock 

any records for a given time interval have been lost or are 1 { ianillg only me PDM 404 DOt attached to the syspelx timer 

incomplete. 407, The PDM 404 does not write record updates, but the 

FIGS. 5 and 6 show a journal record format created by the record updates, as stated previously, must be synchronized 

PDM 404 for each self describing record, including a prefix 15 to a common time source. 

header 500 (FIG. 5), and a record set information 600 (FIG. Referring now to FIG. 6. the record set information 600 

6) as generated by the primary storage controller 405. Each is generated by the primary storage controllers 405 and 

self describing record is further journaled by the SDM 414 collected by the PDM 404. Update Specific Information 

for each time interval so that each self describing record can 601-610, includes a primary device unit address 601 of each 

be applied in time sequence for each time interval to the 20 record indicating the actual primary DASD 406 that the 

secondary DASDs 416 record update occurred on. A cylinder number/head number 

Referring now to FIG. 5, the prefix header 500, which is (C ^J™ ^1^9^ ^ T ^ 

inserted at me front of each record set, includes a total data ^record update. Primary SHD 603, the primary storage 

length 501 for describing the total length of the prefix header ™ ■» fce same as primary SSID 

500 and actual primary record set information 600 that is 25 ***** *Ta^£^*2??v mf< ^ matlOI, 

transmitted to the SDM 414 for each record set An opera- sgf* p^re«H^620foUow. Sequence numbers 

tional time stamp 502 is a time stamp indicating a start time ™* » "fW* a ^ ^ f {f **^ g 

for the operational set that mePDM 404 is currently * c ^^Jf 

r*oce*sin^The operational time stamp 502 is generated by fcrrcd to * c ™"?>- 5™* ^^T^?^ ™ 

Se PDM 404 (according to thesyspiex timer 407) when 30 " » ™*^^**% U * 1*° ^J*™- 

performing a READ RECORD SET function to a set of the bo * P^^cd on each record, the operaUonj uidicators 

£toaiy^age controllers 405. An I/O time 610 (FIG. 6) of f^*?^^ f™ ^ ^ 5**^°^ 

Se primary DASDs 406 write is unique for each primary foUow ; m tn *j£^™'™ ™*^^<*?** 

storage controller 405 READ RECORD SET. The opera- f*™* 5 ™? ^<™*- Scarf argument 607 mdicates 

tional time stamp 502 is common across all storage control- 35 ^ rx>sitio^g iiifcjir^on fe Ae first read record set 

lers data record 620. A sector number 608 identifies that sector 

i nn^nrnnnnrtTr j • a w .1. nnw that the record was updated at Count field 609 describes a 

A READ RTCO^SFrcoramand is issued by the PDM Qumber of spedfic rc cc^data fields 620 that follow. A host 

404 and can be predicated upon one of the following ^ ^ ^ DASD 406 write update 

conditions. ^ occurred is recorded in time of updates 610. Specific record 

(1) Primary storage controller 405 attention interrupt data ^ provides a count/key/data (CKD) field of each 
based upon that primary storage controller predeter- record update. Lastly, the sequence number 630 is compared 
mined threshold; t0 ^ sequence number 605 for indicating whether the entire 

(2) Primary processor 401 timer interrupt based upon a rca d record set was transferred to the PDM 404. 
predetermined time interval; or 45 The update records are handled in software groups called 

(3) Record set information indicates additional informa- consistency groups so that the SDM 414 can copy the record 
tion on outstanding record sets available but not yet updates in the same order they were written at the primary 
read. DASDs 406. The information used for creating the consis- 

Condition (2) uses a timer interval to control how far tency groups (across aU record sets collected from all 

behind the secondary system 431 executes during periods of 50 storage controllers 405) includes the: operational time stamp 

low activity. Condition (3) occurs when the PDM 404 fails 502; time interval group number 503; sequence number 

to drain aU record sets during a processing interval which within group 504 ; primary controller SSID 505 ; records read 

drives further activity for ensuring that the PDM 404 beeps time 507; primary device address 601; the primary SSID 

up with primary storage controller 405 activity. 603; and the status flags 604, The information used for 

A time interval group number 503 is supplied by the PDM ss determining whether all records for a time interval group 

404 to identify a time interval (bounded by operational time have been received for each storage controller 405 at the 

stamp 502 and a records read time 507) for which the current SDM 414 includes the: time interval group number 503; 

record sets belong (sets of records across all primary storage sequence number within group 504; physical controller ID 

controllers 405 for a given time interval group form con- 505; the primary SSID 603; and a total number of read 

sistency groups). A sequence number within group 504 is 60 record sets returned from each primary storage controller 

derived based upon a hardware provided identification (to 405 for each operational time interval. The information 

the PDM 404) of a write sequence order of application necessary to place record updates on the secondary DASDs 

WRITE I/Os for primary storage controller 405 for each 416 equivalently to the primary DASDs 406 record updates 

record set within a given time interval group 503. A primary with full recover possible includes the: secondary target 

SSID (substoragc identification) 505 uniquely identifies the 65 volume 506; CCHH 602; primary DASD write I/O type 606; 

specific primary storage controller of the primary storage search argument 607; sector number 608; count 609; time of 

controllers 405 for each record set. A secondary target updates 610; and the specific record data 620. 
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FIGS. 7 and 8 show a state table 700 and a master journal 
800, respectively, for describing a current journal contents, 
which simplifies recovery time and journal transfer time. 
The state table 700 provides configuration information, 
collected by and common to the PDM 404 and SDM 414. 
and includes primary storage controller session identifiers 
(SSID numbers) and the volumes therein, and the corre- 
sponding secondary storage controller session identifiers 
and the corresponding volumes. Thus the configuration 
information tracks which primary volumes 710 or primary 
DASD extents map to secondary volumes 711 or secondary 
DASD extents. With a simple extension to (he state table 700 
indicating partial volume extents 712 (CCHH to CCHH), 
partial volume remote copy can be accomplished using the 
same asynchronous remote copy methods described herein, 
but for a finer granularity (track or extent) than full volume. 

The master journal 800 includes: consistency group num- 
ber; location on journal volumes; and operational time 
stamp. The master journal 800 further maintains specific 
record updates as grouped in consistency groups. The state 
table 700 and master journal 800 support disaster recovery, 
and hence must be able to operate in a stand-alone environ- 
ment wherein the primary system 401 no longer exists. 

A time stamp control is placed at the front and back of 
each master journal 800 to ensure that the entire control 
entry was successfully written. The time stamp control is 
further written to the secondary DASDs 417. The control 
elements include dual entries (1) and (2), wherein one entry 
is always a current entry, for example: 

(1) Tniwstamp control I Contool Info I rtmestamp Control 

(2) Tuu esump Control 1 Control Info I Tanestamp Control 

At any point in time either entry (1) or (2) is the current 
or valid entry, wherein a valid entry is that entry with equal 
times tamp controls at the front and back. Disaster recovery 
uses the valid entry with the latest time stamp to obtain 
control information. This control information, along with 
state information (environmental information regarding 
storage controllers, devices, and applied consistency 
groups), is used for determining what record updates have 
been applied to the secondary storage controllers 415. 

CONSISTENCY GROUPS 

After all read record sets across all primary storage 
controllers 405 for a predetermined time interval are 
received at the secondary site 431, the SDM 414 interprets 
the received control information and applies the received 
read record sets to the secondary DASDs 416 in groups of 
record updates such that the record updates are applied in the 
same sequence mat those record updates were originally 
written on the primary DASDs 406. Thus, all primary 
application order (data integrity) consistency is maintained 
at the secondary site 431. This process is hereinafter referred 
to as forming consistency groups. Forming consistency 
groups is based on the following assumptions: (A) applica- 
tion writes mat are independent can be performed in any 
order if they do not violate controller sequence order, (B) 
application writes that are dependent must be performed in 
timestamp order, hence an application cannot perform a 
dependent write number two before receiving control unit 
end, device end from write Dumber one; and (Q a second 
write will always be either (1) in a same record set consis- 
tency group as a first write with a later timestamp or (2) in 
a subsequent record set consistency group. 

Referring to FIG. 9, an example of forming a consistency 
group (the consistency group could be formed at either the 
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primary site 421 or secondary site 431). for example, for 
storage controllers SSID 1, SSID 2. and SSID 3 is shown 
(any number of storage controllers can be included but three 
are used in this example for clarity). Time intervals Tl, T2 
5 and T3 are assumed to occur in ascending order. An opera- 
tional time stamp 502 of time interval Tl is established for 
storage controllers SSID 1, SSID 2 and SSID 3. The PDM 
404 obtains record set data from storage controller SSIDs 1, 

2, and 3 for time interval T1-T3. The record sets for SSIDs 
i0 1, 2, and 3 for time interval Tl are assigned to time interval 

group 1, Gl (time interval group number 503). The sequence 
number within group 504 is shown far each SSID 1, 2. and 

3. wherein SSID has three updates as 11:59. 12:00. and 
12:01, SSID 2 has two updates at 12:00 and 12:02. and SSID 

15 3 has three updates at 11:58. 11:59. and 12:02. Record sets 
of time intervals T2 and T3 are listed but example times of 
updates are not given for simplicity. 

Consistency group N can now be formed based upon the 
control information and record updates received at the 

20 secondary site 431. In order to ensure that no record update 
in time interval group number one is later than any record 
update of time Interval group number two, a min-time is 
established which is equal to a the earliest read record set 
time of the last record updates for each storage controller 

25 SSID 1, 2, and 3. In this example then, min-time is equal to 
12:01. Any record updates having a read record set time 
greater than or equal to min-time is included in the consis- 
tency group N+l. If two record update times to a same 
volume were equal, though unlikely given sufficient reso- 

M lutionof me sysplextiniCT 407, me reccrd update having 
earlier sequence number within the time interval group N is 
kept with that group far consistency group N. The record 
updates are now ordered based upon read record set times. 
Record updates having equal times will cause the record 

35 update having the lower sequence number to be place before 
the later sequence numbered record update. Alternatively, 
record updates having equal time stamps, but to diflfering 
volumes, may be ordered arbitrarily as long as they are kept 
in the same consistency group. 

40 If a primary storage controller 405 fails to complete a 
response to a read record set during a specified time interval 
then a consistency group cannot be formed until that primary 
storage controller 405 completes. In the event mat the 
primary storage controller 405 fails to complete its 

45 operation, then a missing interrupt results causing a system 
missing interrupt handler to receive control and the opera- 
tion will be terminated. On the other hand if the primary 
storage controller 405 timely completes the operation then 
the I/O will be driven to completion and normal operation 

so will continue. Consistency group formation expects that 
write operations against the primary storage controllers 405 
will have time stamps. Some programs, however, will cause 
writes to be generated without time stamps, in which case 
the primary storage controller 405 will return zeros for the 

55 time stamp. Consistency group formation can bound those 
records without time stamps based upon the timestamp mat 
the data was read. If too many record updates without time 
stamps occur over a time interval such that the record 
updates are not easily bounded by consistency group times, 

6o then an error that the duplex volumes are out of synchroni- 
zation may result 

FIGS. 10 and 11 arc flow diagrams presenting the method 
of forming consistency groups. Referring to FIG. 10. the 
process starts at step 1000 with the primary site 421 estab- 

65 lisbing remote data shadowing to occur. At step 1010 all 
application I/O operations are time stamped using the sys- 
plex timer 407 as a synchronization clock (FIG. 4). The 
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PDM 404 starts a remote data shadowing session with each Those remaining record updates in the current consistency 

primary storage controller 4t5 at step 1020 which includes group journal record are ordered according to time of update 

identifying those primary volumes that will have data or 610 and sequence number within group 504 by step 1140. A 

records shadowed. Record set information 600 is trapped by primary storage controller 4*5 that had no record updates 

the primary storage controllers 405 for each application 5 does not participate in the consistency group. At step 1150, 

WRITE I/O operation (see FIG. 6) by step 1030. the remaining record updates of the current consistency 

Step 1040 involves the PDM 404 reading the captured group (having update times later than ruin-time) are passed 

record set information 600 from each primary storage con- to the next consistency group. Each sequence number within 

(roller 405 according to a prompt including an attention a group 504 should end with a null buffer indicating mat all 

message, a predetermined timing interval, or a notification 10 rea d record sets have been read far that operational time 

of more records to read as described earlier. When the PDM interval. If the null buffer is absent, then the step 1120 of 

404 begins reading record sets, at step 1050, the PDM 404 defining the last record in the current software consistency 

prefixes each record set with a prefix header 500 (see FIG. group, coupled with the records read time 507 and time of 

5) for creating specific journal records (a journal record update 610 can be used to determine the proper order of the 

includes the prefix header 500 and the record set information l5 application WRITE I/O operations across the primary stor- 

600). The journal records contain the control information age controllers 405. 

(and records) necessary for forming consistency groups at Stcp im represents a back-end of the remote data 

the secondary site 431 (or at the primary site 421). shadowing process wherein specific write updates are 

At step 1060 the PDM 404 transmits the generated journal applied to secondary DASDs 416 under full disaster recov- 

records to the SDM 414 via the communications link 408 (or 20 cry constraint. If when writing the updates to the secondary 

within the same data mover system if the consistency groups daSDs 416 an I/O error occurs, or the entire secondary site 

are formed therein). The SDM 414 uses the state table 700 432 do Wn m £ ^ re-initialized, then the entire consis- 

at step 1070 to gather the received r<*ord updates by group tcflCY &Q{ip mat was m the process of being written can be 

and sequence numbers for each time interval group and re-applied from the start. This permits the remote shadowing 

primary storage controller 405 established for the data 25 to occur without having to track which secondary DASDs 

shadowing session. The SDM 414 inspects the journal 415 j/q$ nave occurred, which I/Os have not occurred, and 

records at step 1080 to determine whether all record infer- which j/Os were in process, etc. 
mation has been received for each time interval group. If the 

journal records are incomplete, then step 1085 causes the SECONDARY I/O WRITES 

SDM 414 to notify the PDM 404 to resend the required 30 ^ 

record sets. If the PDM 404 is unable to correctly resend, A key component of step 1160 is that the PDM 414causes 

then the duplex volume pair is failed. If the journal records the records to be written efficiently to the secondary DASDs 

are complete, then step 1090 is performed which encom- 416 so that the secondary site 431 can keep pace with the 

passes the SDM 414 forming the consistency groups. primary site 42L The requisite efficiency is accomplished, in 

deferring to FIG. 11, steps 1100-1160 representing step 35 Pf* ■* ^^^^^T 1 ^? 1/0 t0 
1090 (FIG 10) for forminTconsistency groups is shown ***** *™*ary DASDs 416 Serially ^ writing one sec- 
Consistency group formation starts a'stepUOO wherein ondary DASD 416 *a^ would ™J*£^<* 
each sofW consistency group is written^ an SDM 414 431 *> faU too farbehand the pnrnary ^ more 
journal log Cliardened") onX secondary DASD 417 (FIG. efficiency is gained at the secondary s*e » «M *' 
^Stcp 1110 performs a test for determining whether the 40 rcCOTd f ***** ^^^^T^JZnA w£d 
time interval groups are complete, that is, each primary secondary device via a single cfuumd conmiand word 
storage controto405 musThave either presented *aMeast chain. Within CCW ^ 
oneTead record set buffer or have con&mation from the operatic-ns therein can be ^<#^ t »^*^ 
PDM 404 that no such record updates were placed in the I/O operations to each secondary DASD 4W data track are 
record set buffer, and all read recordset buffers with data (or 45 maiiitained u the order of occurrence on the primary vol- 
null) must have been received by the SDM 414. If a time 

interval group is incomplete, then step U10 retries reading Optimizing secondary I/O operations for specific consu- 
me record sets from the primary storage controller 405 until tency groups and within single CCW chains is based in part 
the required data is received. If errors occur, a specific upon the pattern of primary write I/O operations, and in part 
duplex volume pair or pairs may be failed. Having received 50 upon the physical characteristics of the secondary DASDs 
complete time interval groups, step 1120 determines a first 416. Optimization may vary somewhat depending upon 
consistency group journal record. The first (or current) whether secondary DASD 415 is count/key/data (CKD), 
consistency group journal record is that record which con- extended counttey/data (ECKD), fixed block architecture 
tains the earliest operational time stamp 502 and the earliest (FB A), etc. Consequently, a number of WRITE VOs (m) to 
time of update 610 of all records having equal operational 55 a primary DASD 406 volume during a given time interval 
time stamps 502. can be reduced to a single START I/O operation to a 
Step 1130 inspects the records contained in the current secondary DASD 416 volume. This optinuzation of the 
consistency group journal record to determine which record number of START VOs to the secondary storage controllers 
will be the last record to be included therein (some records 415 of mil can allow the secondary DASDs 416 to catch up 
will be dropped and included in the next consistency group 60 with and thereby closer shadow the record updates at the 
journal record). The last record in the current consistency primary site 421. 

group journal record is determined as a minimum update A key to successful remote data shadowing, and hence 

time (nun-time) of the mtimmn update times for each secondary I/O optimization, is minimizi ng unrecoverable 
primary storage controller 405 (that is, the last update of errors in any of the concurrent multiple I/O operations to 
each primary storage controller 405 is compared and only 65 secondary DASDs 416 so that consistent copies arc avail- 
the earliest of these remains in the current consistency group able for recovery. A failure in a given secondary write could 

journal record). permit a later dependent write to be recorded without the 
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conditioning write (e.g., a log entry indicating that a data 
base record has been updated when in reality the actual 
update write for the data base had failed violates the 
sequence integrity of the secondary DASD 416 copy). 

A failed secondary 416 copy is unusable for application 
recovery until that failure to update has been recovered. Hie 
failed update could be corrected by having the SDM 414 
request a current copy from the PDM 404. In the mean time 
the secondary data copy is inconsistent and hence unusable 
until the PDM 404 responds with the current update and all 
other previous updates are processed by the PDM 414. The 
time required to recover the failed update typically presents 
an unacceptably long window of non-recovery for adequate 
disaster recovery protection. 

Effective secondary site 431 I/O optimization is realized 
by inspecting the data record sets to be written for a given 
consistency group and building chains based upon rules of 
the particular secondary DASD 416 architecture, for 
example, an ECKD architecture. The optimization technique 
disclosed herein simplifies recovery from I/O errors such 
that when applying a consistency group, if an I/O error 
occurs, the CCW chain can be re- executed, or in the event 
of a secondary initial program load (IPL) recovery, the entire 
consistency group can be re-applied without data loss. 

FIG. 12 summarizes full consistency group recovery 
(FCGR) rules for building CCW chains for all WRITE I/O 
combinations for an ECKD architecture, wherein CCHHR 
record format is used (cylinder number, bead number, record 
number). FIG. 12 is created by inspecting each possible 
combination of WRITE I/O operations to a DASD track 
within a consistency group's scope. The FCGR rules of FIG. 
12, described in FIGS. 13A and 13B, are then followed to 
govern data placement (secondary DASD 416 I/O write 
CCW chains) for yielding full recovery for an error in 
applying a consistency group. The FCGR rules depicted in 
FIG. 12 would be extended appropriately as new WRITE 
I/O operations are added. These rules can exist in hardware 
or software at the secondary site 431. The FCGR rules 
advantageously reduces READ record set to a same DASD 
track analysis to an inspection of the primary DASD 406 
WRITE I/O type, search argument, and count and key fields. 

If a DASD track is written without inspecting the con- 
sistency group write operations as shown in FIG. 12, then 
previously written data records potentially cannot be 
re-written. For example, assume that a chain includes: 

WRITE UPDATE to record five; and 

FORMAT WRITE to record one, 
wherein record one and record five occur on the same DASD 
track with record one preceding record five. Record five is 
updated by an UPDATE WRITE CCW and a FORMAT 
WRITE I/O CCW updates record one erasing a remainder of 
the track thus deleting record five. If this chain had to be 
re-executed, a LOCATE RECORD CCW mat will position 
to the beginning of record five will no longer have a 
positioning point (record five no longer exists), and the chain 
is not fully recoverable from the beginning. Since the write 
operations have already been successful at the primary site 
421, always being able to apply an entire consistency group 
on the secondary DASD 416 is required to maintain data 
consistency and integrity. 

FIG. 14, steps 1410 through 1470, provides more detail as 
to the process represented by step 1160 of FIG. 11, while 
using the FCGR rules defined in FIG. 12. At step 1410 the 
SDM 414 divides the records of the current consistency 
group into two categories. A first category includes I/O 
orders directed to a same secondary DASD volume, and a 
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second category includes I/O orders of th ose r ecords in the 
first category that are directed to a same CCHH (Lc records 
being updated to a same DASD track). 

Having categorized the records of the current consistency 
group, step 1420 conforms application WRITE I/Os and 
SDM 414 WRITE I/Os to the architecture of the secondary 
DASDs 416, for example, to ECKD architecture FCGR 
rules (see FIG. 12) for identifying data placement on a track 
and track/record addressing. The SDM 414 groups second- 
ary DASD WRITE I/O operations to the same volumes into 
single I/O CCW chains at step 1430. Step 1440 involves 
moving the head disk assembly (HDA) of each secondary 
DASD 416 according to search arguments and specific 
record data (CKD fields) for the actual secondary DASD 
416 writes. 

Step 1450 compares READ SET BUFFERS one and two 
for those records making up the second categories (there 
typically will be a plurality of second categories, one for 
each track receiving records), using the FCGR rules of FIG. 
12 for determining whether a subsequent write operation 
invalidates a previous write operation or DASD search 
argument (positioning at a record that is now erased, etc). 
The READ SET BUFFERS one and two contain adjacent 
read record sets. Following the FCGR rules ensures that the 
SDM 414 can re-write an entire consistency group, in the 
event of an error, without re-receiving record updates from 
the primary site 421. After the SDM 414 applies the current 
consistency group to the secondary DASD 416, step 1460 
updates the state table (FIG. 7) and the master journal (FIG. 
8). 

The remote copy process continues in real time as step 
1470 gets a next consistency group (which becomes the 
current consistency group) and returns processing to step 
1410. The remote copy process will stop if the primary site 
421 to secondary site 431 communication terminates. The 
communication may terminate if volume pairs are deleted 
from the process by the PDM 404, the primary site is 
destroyed (disaster occurs), an orderly shutdown is 
performed, or a specific takeover action occurs at the sec- 
ondary site 431. Consistency groups journaled on the sec- 
ondary site 431 can be applied to the secondary DASD 416 
during a takeover operation. The only data lost is that data 
captured by the primary site 421 that has not been com- 
pletely received by the SDM 414. 

In summary, synchronous and asynchronous remote data 
duplexing systems have been described. The asynchronous 
remote data duplexing system provides storage based, real 
time data shadowing. A primary site runs applications gen- 
erating record updates, and a secondary site, remote from the 
primary site, shadows the record updates and provides 
disaster recovery for the primary site. The asynchronous 
remote d* ta duplexing system comprises a sysplex timer for 
synchronizing time dependent processes in the primary site, 
and a primary processor at the primary site for running the 
applications, the primary processor having a primary data 
mover therein. A plurality of primary storage controllers are 
coupled to the primary processor for issuing write I/O 
operations for each record update, each primary storage 
controller DASD write I/O operation being synchronized to 

i the sysplex timer. A plurality of primary storage devices 
receive the write I/O operations and store the record updates 
therein accordingly. The primary data mover collects record 
set information from the plurality of primary storage con- 
trollers for each record update and appends a prefix header 

; to a predetermined group of record set informations. The 
prefix header and predetermined group of record set infor- 
mations form the self describing record sets. Each record set 
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information includes a primary device address, a cylinder (f) forming consistency groups from the time interval 

number and head number (CCHH). a record update groups of the self describing record sets, the record 

sequence number, a write I/O type, a search argument, a updates being ordered within the consistency groups 

sector number, and a record update time. The prefix header based upon rime sequences of the write I/O operations 

includes a total data length, an operational time stamp, a 3 to me primary storage subsystem; and 

time interval croup number, and a records read time. A r . 

secondary processor at the secondary site has a secondary to shadowing the record updates of each consistency 

data mover, the secondary data mover receiving the self group to the secondary storage subsystem in a sequence 

describing record sets from the primary site. A plurality of consistent order. 

secondary storage controllers are coupled to the secondary iQ 2. The method according to claim 1 wherein the record 

processor, and a plurality of secondary storage devices are sets are transmitted to the secondary processor asynchro- 

coupled to the secondary storage controllers for storing the nously. 

record updates copies. The secondary data mover deter- 3, The method according to claim 1 wherein the step (f) 

mines whether the transmitted self describing record sets are j s performed at the secondary site, 

complete and forms consistency groups from the self J5 4 The method according to claim 1 wherein the step (e) 

describing record sets and provides the record updates from fu^er includes determining at the secondary site whether 

each consistency group to the plurality of secondary storage each received ^ describing record set is complete, 

controllers for writing to the plurality of secondary storage $ The according to claim 4 wherein the step (e) 

devices in an order consistent with a sequence that the record iQcludes rcquesting ue primary site to retransmit 

updates were written to the plurality of primary storage ^ ^ ^dupdates if me priraary site determined a 

While the invention has been particularly shown and ^J^f™* TCC0 ** i ^^ plCtC * - - 

described with reference to Referred embodiments thereof, *' ™ c memod according to claim I furihv^jw a 

it will be understood by those skilled in the art that various stc P determining at the secondary site whether each time 

changes in form and details may be made therein without interval group is complete. 

departing from the spirit and scope of the invention. For 25 * The method according to claim 6 wherein the step (h) 

example, the consistency groups have been described as further includes requesting the primary site to re-send a 

being formed by the secondary data mover based upon missing record set if the secondary site determined that the 

received self describing record sets, however, the consis- time interval group was incomplete, 

tency groups could be formed at the primary site based upon ^ 8. The method according to claim 1 wherein the step (b) 

write record sets or elsewhere in the secondary site. The identifies, in the record set information, a physical location 

formats the storage devices at the primary and secondary on the primary storage devices where each record update is 

sites need not be identical. For example, CKD records could stored. 

be converted to fixed block architecture (FBA) type records, 9. The method according to claim 8 wherein the step (b) 

etc. Nor are the storage devices meant to be limited to DASD 3J identifies, in the record set information, a sequence and time 

devices. of update of each record update stored to the primary storage 

What is claimed is: devices within the session. 

1. In a system providing remote data shadowing for 10. The method according to claim 1 wherein the step (d) 

disaster recovery purposes, the system including a primary idcntifi „ ^ header, an interval group number for 

site having a primary processor runmng a r^mary data ^ ^ session ^^cc within group for each record 

mover and applicauons generating record updates, the pn- £ 

mary processor coupled to a primary storage subsystem , ... . . 

ha^torage devices for storing the record urates accord- , 11 A remote data shadowing system including a primary 

ing to write I/O operations issued by the primary processor site and a secondary site, me secondary site asynchronously 

to the primary storage subsystem, the primary site further < shadowing record updates of the primary site in real time far 

including a common system timer for synchronizing time 45 disaster recovery purposes, the record updates generated by 

dependent operations in the primary site, the system further applications running at the primary site, the primary site 

including a secondary site having a secondary processor comprising: 

commiinirating with the primary proces sor, and a secondary a sysplex timer; 

storage subsystem for storing copies of the record updates in ^ a primary processor running the applications generating 

sequence consistent order, a method for shadowing the ^ record updates and issuing a corresponding write 

record updates in sequence consistent order comprising yQ opemioQ fQf ^ ^4 update, the primary 

^ , , t , w _ . _ . processor having a primary data mover therein; 

(a) time stamping each write I/O operation in the primary r _ ' , „ v , , ^ 

storagelubsystem; „ a plurality of primary storage controllers directed to store 

(b) capturing record set information for the record updates the record updates, the plurality of primary storage 
frorn the primary storage subsystem; conrro^exearting the issued write I/O operation for 

(c) reading into the primary data mover the record updates cach rccOTd u P datc; 411(1 

and the record set information to form record sets; a plurality of primary storage devices receiving and 

(d) prefixing each of the record sets with a header to create « storing the record updates therein according to the 
self describing record sets, the self describing record corresponding write I/O operations, 

sets to be used by the secondary processor to re-create wherein the primary processor and each write I/O are 
a sequence of the write I/O operations at the secondary time-stamped by the primary processor, as synchro- 
site; nized by the sysplex timer, such thai write I/O opera- 

(e) transmitting the self describing record sets to the 65 dons are accurately sequence ordered relative to each 
secondary processor in time interval groups according other, the primary data mover collecting sets of record 
to predetermined time intervals; updates and combining each record set information as 
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provided by each one of the plurality of primary storage 
controllers with the corresponding record update, each 
record set information including a relative sequence 
and time of each corresponding write I/O operation, the 
primary data mover collecting record updates into time 5 
interval groups and inserting a prefix header to each 
time interval group, wherein the prefix header includes 
information identifying the record updates included in 
each time interval group, each record set information 
and prefix header combined for creating self describing 10 
record sets, the self describing record sets being trans- 
mitted to the secondary site, wherein the self describing 
record sets provide information adequate far the sec- 
ondary she to shadow the record updates therein in 
sequence consistent order without further communica- 15 
tions from the primary site. 

12. The remote data shadowing system according to claim 

11 wherein the primary data mover forms time interval 
groups by establishing a session with all primary storage 
controllers identified to participate in remote data shadow- 20 
ing. 

13. The remote data shadowing system according to claim 

12 wherein the primary data mover collects record set 
information from all the identified primary storage control- 
lers. 25 

14. The remote data shadowing system according to claim 

13 wherein the primary processor transmits the self describ- 
ing records to the secondary site. 

15. The remote data shadowing system according to claim 

14 wherein the secondary site forms consistency groups 30 
from the transmitted self describing records. 

16. The remote data shadowing system according to claim 
13 wherein the primary data mover forms consistency 
groups from the self describing records. 

17. The remote data shadowing system according to claim 35 
13 wherein the primary data mover creates a state table 
journaling record updates and cross referencing a storage 
location of each record update on the primary system and the 
secondary system. 

18. The remote data shadowing system according to claim 40 
13 wherein the plurality of primary storage devices are 
direct access storage devices (DASDs). 

19. The remote data shadowing system according to claim 

15 wherein each record set information comprises: 

a primary device address; 43 
a cylinder number and head number (CCHH); 
a record update sequence number, 
a write I/O type; 

a search argument; 50 
a sector number, and 
a record update time. 

2& The remote data shadowing system according to claim 
18 wherein the prefix header comprises: 
a total data length; 
an operational time stamp; 
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a time interval group number; and 
a records read time. 

2L An asynchronous remote data duplexing system pro- 
viding storage based, real time data shadowing, including a 
primary site running applications generating record updates 
and having a secondary site, remote from the primary site, 
the secondary site shadowing the record updates and pro- 
viding disaster recovery for the primary site, the asynchro- 
nous remote data duplexing system comprising: 
a sysplex timer for synchronizing time dependent pro- 
cesses in the primary site; 
a primary processor at the primary site for running the 
applications and issuing write I/O operations for cor- 
responding record updates, the primary processor hav- 
ing a primary data mover therein; 
a plurality of primary storage controllers receiving the a 
write I/O operation for each record update, each pri- 
mary storage controller write I/O operation synchro- 
nized to the sysplex timer by the primary processor; 
a plurality of primary storage devices storing the record 
updates therein according to the corresponding write 
I/O operation, 

wherein the primary data mover collects record set infor- 
mation from the plurality of primary storage controllers 
for each record update and appends a prefix header to 
a predetermined grouped of record set informations, the 
prefix header and predetermined record set information 
groups forming self describing record sets, each record 
set information including a primary device address, a 
cylinder number and head number (CCHH), a record 
update sequence number, a write I/O type, a search 
argument, a sector number, and a record update time, 
and wherein the prefix header includes a total data 
length, an operational time stamp, a time interval group 
number, and a records read time; 

a secondary processor at the secondary site having a 
secondary data mover, the secondary data mover 
receiving the self describing record sets from the pri- 
mary site; 

a plurality of secondary storage controllers coupled to the 

secondary processor; and 
a plurality of secondary storage devices storing the record 

updates, 

wherein the secondary data mover determines whether the 
transmitted self describing record sets are complete and 
forms consistency groups from the self describing 
record sets and provides the record updates from each 
consistency group to the plurality of secondary storage 
controllers for writing to the plurality of secondary 
storage devices in an order consistent with a sequence 
that the record updates were written to the plurality of 
primary storage devices. 
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